By Brian Keegan, Ph.D. -- October 4, 2014
Released under a CC-BY-SA 3.0 License.
Importing libraries we'll want to use throughout the analysis right away.
In [3]:
# Standard packages for data analysis
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# pandas handles tabular data
import pandas as pd
# networkx handles network data
import networkx as nx
# json handles reading and writing JSON data
import json
# To visualize webpages within this webpage
from IPython.display import HTML
# To run queries against MediaWiki APIs
from wikitools import wiki, api
# Some other helper functions
from collections import Counter
from operator import itemgetter
We are going to make a query using the list=users
action. This page contains all the documentation for the different queries you can run from Wikipedia's MediaWiki API. For our first test query, we'll want to get the information about a single user. Search for "list=users". You can also find similar information about this specific query action here in the general MediaWiki documentation.
We can actually write a test query in the URL which will return results if the parameters are all valid. Use the example given on the api.php
page of documentation:
There are four parameters in this API call, separated by &
signs:
action
- We pass a query
option here to differentiate it from other actions we can run on the API like parse
. But action=query
will be what we use much of the time.list
- This is one of several parameters we can use to make a query; search for "action=query" for others besides list
. We pass a users
option to list
because we want to generate information about users. This lets us run the sub-options detailed in the documentation below.ususers
- Here we list the names of Wikipedia users we want to get information about. We can pass more than one name by adding a pipe "|
" between names. The documentation says we can only pass up to 50 names per request. Here we pass two names Madcoverboy
for yours truly and Jimbo_Wales
for the founder of Wikipedia.usprop
- Here we pass a list of options detailed under the list=users
about information we can obtain about any user. Again we use pipes to connect multiple options together. We are going to get information about whether a user is currently blocked (blockinfo
), what powers the user has (groups
), their total number of edits (editcount
), the date and time they registered their account (registration
), and their self-reported gender (gender
).In summary, this API request is going to perform a query action that expects us to pass a list of user names and will return information about the users. We have given the query the names of the users we want information about as well as the specific types of information about each of these users.
The codeblock below shows what clicking the URL should return.
In [27]:
HTML('http://en.wikipedia.org/w/api.php?action=query&list=users&ususers=Madcoverboy|Jimbo_Wales&usprop=blockinfo|groups|editcount|registration|gender')
Out[27]:
There's a lot of padding and fields from the XML markup this returns by default, but the data are all in there. My userid
is "304994", my username
is "Madcoverboy" (which we already knew), I have 12,348 edts, I registered my account on June 21, 2005 at 1:52:16pm GMT, I identify as male, and I'm a member of four "groups" corresponding to my editing privileges: reviewer
, *
, user
, and autoconfirmed
.
Clicking on the link will run the query and return the results in your web browser. However, the point of using an API is not for you to make queries with a URL and then copy-paste the results into Python. We're going to run the query within Python (rather than the web browser) and return the data back to us in a format that we can continue to use for analysis.
First we need to write a function that will accept something that corresponds to the query we want to run, goes out and connects to the English Wikipedia's MediaWiki API "spigot", formats our query for this API to understand, runs the query until all the results come back, and then returns the results to us as some data object. The function below does all of those things, but it's best to just treat it as a black box for now that accepts queries and spits out the results from the English Wikipedia.
(If you want to use another another MediaWiki API, replace the current URL following site_url
with the corresponding API location. For example, Memory Alpha's is http://en.memory-alpha.org/api.php)
In [5]:
def wikipedia_query(query_params,site_url='http://en.wikipedia.org/w/api.php'):
site = wiki.Wiki(url=site_url)
request = api.APIRequest(site, query_params)
result = request.query()
return result[query_params['action']]
We can write the exact same query as we used above using a dictionary for all the same request parameters as key value pairs and save the dictionary as user_query
. For example, where we used action=query
in the URL above, we use 'action':'query'
as a key-value pair of strings (make sure to include the quotes marking these as strings, rather than variables!) in the query dictionary. Then we can pass this query dictionary to the wikipedia_query
black box function defined above to get the exact same information out. We save the output in query_results
and can look at the results by calling this variable.
In [6]:
user_query = {'action':'query',
'list':'users',
'usprop':'blockinfo|groups|editcount|registration|gender',
'ususers':'Madcoverboy|Jimbo Wales'}
query_results = wikipedia_query(user_query)
query_results
Out[6]:
The data structure that is returned is a dictionary keyed by 'users'
which returns a list of dictionaries. Knowing that the data corresponding to Jimbo Wales
is the second element in the list of dictionaries (remember Python indices start at 0, so the 2nd element corresponds to 1), we can access his edit count.
In [7]:
query_results['users'][1]['editcount']
Out[7]:
Instead of writing each query manually, we can define a function get_user_properties
that accepts a user name(s), and returns ther results of the query used above but replaces "Madcoverboy" and "Jimbo Wales" with the user name(s) passed.
In [8]:
def get_user_properties(user):
result = wikipedia_query({'action':'query',
'list':'users',
'usprop':'blockinfo|groups|editcount|registration|gender',
'ususers':user})
return result
We can test this function on another user, "Koavf" who is the most active user on the English Wikipedia. We'll save his results to koavf_query_results
.
In [9]:
koavf_query_results = get_user_properties('Koavf')
koavf_query_results
Out[9]:
All the data we've collected in query_results
and koavf_query_results
exist only in memory. Once we shut this notebook down, these data will cease to exist. So we'll want to save these data to disk by "serializing" into a format that other programs can use. Two very common file formats are JavaScript Object Notation (JSON) and Comma-separated Values (CSV). JSON is better for more complex data that contains a mixture of strings, arrays (lists), dictionaries, and booleans while CSV is better for "flatter" data that you might want to read into a spreadsheet.
We can save a the koavf_query_results
as a JSON file by creating and opening a file named koavf_query_results.json
and referring to this connection as f
. We use the json.dump
function to translate all the data in the koavf_query_results
dictionary into the file and once this process is done, the file automatically closes itself so that other programs can access it.
In [10]:
with open('koavf_query_results.json','wb') as f:
json.dump(koavf_query_results,f)
Check to make sure this data was properly exported by reading it back in as loaded_koavf_query_results
.
In [11]:
with open('koavf_query_results.json','rb') as f:
loaded_koavf_query_results = json.load(f)
loaded_koavf_query_results
Out[11]:
The query_results
data has two "observations" corresponding to "Madcoverboy" and "Jimbo Wales". We could create a CSV with the columns corresponding to the field names (editcount
, gender
, groups
, name
, registration
, userid
) and then two rows containing the corresponding values for each user.
Using a powerful library called "pandas" (short for "panel data", not the cute bears), we can pass the list of data inside query_results
and pandas will attempt to convert it to a tabular format called a DataFrame that can be exported to CSV. We save this as df
and then use the to_csv
function to write this DataFrame to a CSV file. We use to extra options, declaring quotation marks to make sure the data in groups
, which already contains commas, doesn't get split up later. We don't care about the row numbers (the index), so we declare index=False
so they aren't exported.
In [12]:
query_results['users']
Out[12]:
In [13]:
df = pd.DataFrame(query_results['users'])
df.to_csv('query_results.csv',quotechar='"',index=False)
df
Out[13]:
Check to make sure this data was properly exported by reading it back in.
In [14]:
pd.read_csv('query_results.csv',quotechar='"')
Out[14]:
In this section, we've covered the basics for how:
In the next sections, we'll use other queries to get more interesting data about relationships and more advanced data manipulation techniques to prepare these data for social network analysis.
We are going to use the prop=links
query to identify a list of articles that are currently linked from an article. We will use the article for "Hillary Rodham Clinton". The general MediaWiki documentation for this query is here. We will specify a query using the action=query
to define a general class of query, the prop=links
to indicate we want the current links from a page, and then pass the name of a page with titles=Hillary Rodham Clinton
.
There are many "namespaces" of Wikipedia pages that reflect different kinds of pages for articles, article talk pages, user pages, user talk pages, and other administrative pages. Links to and from a Wikipedia article can come from all these name spaces, but because the Wikipedia articles that 99% of us ever read are located inside the "0" namespace, we'll only want to limit ourselves to links in that namespace rather than these "backchannel" links. We enforce this limit with the plnamespace=0
option.
There could potentially be hudreds of links from a single article but the API will only return some number per request. The wikitools
library takes care of automatically generating additional requests if there is more data to obtain after the first request. Ideally, we could specify a large number like 10,000 to make sure we get all the links with a single request, but the API enforces a limit of 500 links per request and defaults to only 10 per request. We use the pllimit=500
to make sure we get the maximum number of links per request instead of issuing 50 requests.
In [16]:
outlink_query = {'action': 'query',
'prop': 'links',
'titles': 'Hillary Rodham Clinton',
'pllimit': '500',
'plnamespace':'0'}
hrc_outlink_data = wikipedia_query(outlink_query)
In [78]:
hrc_outlink_data['pages'][u'5043192']['links'][:5]
Out[78]:
The data returned by this query is a dictionary of dictionaries that you'll need to dive "into" more deeply to access the data itself. The top dictionary contains a single key 'pages'
which returns a dictionary containing a single key u'5043192'
corresponding to the page ID for the article. Once you're inside this dictionary, you can access the list of links, which are unfortunately a list of dictionaries! Using something called a "list comprehension", I can clean this data up to get a nice concise list of links, which we save as hrc_outlink_list
. I also print out the number of links in this list and 10 examples of these links.
In [19]:
hrc_outlink_list = [link['title'] for link in hrc_outlink_data['pages'][u'5043192']['links']]
print "There are {0} links from the Hillary Rodham Clinton article".format(len(hrc_outlink_list))
hrc_outlink_list[:10]
Out[19]:
Note that there is an article for "Hillary Clinton" as well, but this article is a redirect. In other words, this article exists and has data that can be accessed from the API, but it's suspiciously sparse and just points to "Hillary Rodham Clinton".
In [20]:
outlink_query_hc = {'action': 'query',
'prop': 'links',
'titles': 'Hillary Clinton',
'pllimit': '500',
'plnamespace': '0'}
hc_outlink_data = wikipedia_query(outlink_query_hc)
hc_outlink_data
Out[20]:
The MediaWiki API has a redirects
option that lets us ignore these placeholder redirect pages and will follow the redirect to take us to the intended page. Adding this option to the query but specifying the same Hillary Clinton
value for the titles
parameter that previously led to a redirect now returns all the data at the "Hillary Rodham Clinton" article. We'll make sure to use this redirects
option in future queries.
In [21]:
outlink_query_hc_redirect = {'action': 'query',
'prop': 'links',
'titles': 'Hillary Clinton', # still "Hillary Clinton"
'pllimit': '500',
'plnamespace': '0',
'redirects': 'True'} # redirects parameter added
hcr_outlink_data = wikipedia_query(outlink_query_hc_redirect)
hcr_outlink_list = [link['title'] for link in hcr_outlink_data['pages'][u'5043192']['links']]
print "There are {0} links from the Hillary Clinton article".format(len(hcr_outlink_list))
hcr_outlink_list[:10]
Out[21]:
We are going to use the prop=linkshere
query to identify a list of articles that currently link to the Hillary Rodham Clinton article. The parameters for this query are a bit different. We still use a namespace limitation so we are only getting pages in the article namespace by specifying lhnamespace=0
and we want to maximize the number of links per query that the API allows by specifying lhlimit=500
. However, we don't want to include redirects that point to this article (e.g., "Hillary Clinton" points to "Hillary Rodham Clinton") by specifying lhshow=!redirect
. Finally we only want the names of the articles, rather than less important information like "pageid" or "redirects", so we can limit this by specifying lhprop=title
.
In [22]:
inlink_query_hrc = {'action': 'query',
'redirects': 'True',
'prop': 'linkshere',
'titles': 'Hillary Rodham Clinton',
'lhlimit': '500',
'lhnamespace': '0',
'lhshow': '!redirect',
'lhprop': 'title'}
hrc_inlink_data = wikipedia_query(inlink_query_hrc)
Again some data processing and cleanup is necessary to drill down into the dictionaries of dictionaries to extract the list of links from the data returned by the query. I use a similar list comprehension as above to get this list of links out. Again, I count the number of links in this list and give an example of 10 links.
In [24]:
hrc_inlink_list = [link['title'] for link in hrc_inlink_data['pages'][u'5043192']['linkshere']]
print "There are {0} links to the Hillary Rodham Clinton article".format(len(hrc_inlink_list))
hrc_inlink_list[:10]
Out[24]:
In the previous two sections, we came up with two separate queries to get both the links from an article and the links to an article. However, much to the credit of the MediaWiki API engineers, you can combine both queries into one. We'll need all the same parameter information that we had included before (pllimit
, lhlimit
, etc.), but we can combine the queries together by combining prop=links
and prop=linkshere
with a pipe (like we did with user names in the very first query), prop=links|linkshere
.
In [25]:
alllinks_query_hrc = {'action': 'query',
'redirects': 'True',
'prop': 'links|linkshere', #combined both prop calls with a pipe
'titles': 'Hillary Rodham Clinton',
'pllimit': '500', #still need the "prop=links" "pl" parameters and below
'plnamespace': '0',
'lhlimit': '500', #still need the "prop=linkshere" "lh" parameters and below
'lhnamespace': '0',
'lhshow': '!redirect',
'lhprop': 'title'}
hrc_alllink_data = wikipedia_query(alllinks_query_hrc)
Again, we need to do some data processing and cleanup to get the lists of links out. However, there are now two different sub-dictionaries within hrc_alllink_data
object reflecting the output from the links
and the linkshere
calls.
In [26]:
hrc_alllink_outlist = [link['title'] for link in hrc_alllink_data['pages'][u'5043192']['links']]
hrc_alllink_inlist = [link['title'] for link in hrc_alllink_data['pages'][u'5043192']['linkshere']]
print "There are {0} out links from and {1} in links to the Hillary Rodham Clinton article".format(len(hrc_alllink_outlist),len(hrc_alllink_inlist))
We can also write a function get_article_links
that takes an article name as an input and returns the lists containing all the in and out links for that article. We use the combined query described above, but replace Hillary's article title with a generic article
variable, run the query, pull out the page_id
, and then do the data processing and cleanup to produce a list of outlinks and a list of inlinks, both of which are passed back out of the function. Again, this query will only pull out the current links on the article, not historical links.
In [27]:
def get_article_links(article):
query = {'action': 'query',
'redirects': 'True',
'prop': 'links|linkshere',
'titles': article, # the article variable is passed into here
'pllimit': '500',
'plnamespace': '0',
'lhlimit': '500',
'lhnamespace': '0',
'lhshow': '!redirect',
'lhprop': 'title'}
results = wikipedia_query(query) # do the query
page_id = results['pages'].keys()[0] # get the page_id
if 'links' in results['pages'][page_id].keys(): #sometimes there are no links
outlist = [link['title'] for link in results['pages'][page_id]['links']] # clean up outlinks
else:
outlist = [] # return empty list if no outlinks
if 'linkshere' in results['pages'][page_id].keys(): #sometimes there are no links
inlist = [link['title'] for link in results['pages'][page_id]['linkshere']] # clean up inlinks
else:
inlist = [] # return empty list if no inlinks
return outlist,inlist
We can test this on Bill Clinton's article, for example.
In [28]:
bc_out, bc_in = get_article_links("Bill Clinton")
print "There are {0} out links from and {1} in links to the Bill Clinton article".format(len(bc_out),len(bc_in))
Lets put the data for both these queries into a dictionary called clinton_link_data
so it's easier to access and save. We'll save this data to disk as a JSON as well so we can access it in the future.
In [30]:
clinton_link_data = {"Hillary Rodham Clinton": {"In": hrc_alllink_inlist,
"Out": hrc_alllink_outlist},
"Bill Clinton": {"In": bc_in,
"Out": bc_out}
}
with open('clinton_link_data.json','wb') as f:
json.dump(clinton_link_data,f)
Having collected data about the neighboring articles that are linked to or from one article, we can turn these data into a network. Using the NetworkX
library (shortened to nx
on import at the top), we will create a DiGraph
object called hrc_g
and then fill it with the connection data we just collected. We do this by iterating over the lists of links (hrc_alllink_outlist
and hrc_alllink_inlist
) and adding a directed edge between each neighbor and the original article. It's important to pay attention to edge direction as the out links should start at "Hillary Rodham Clinton" and end at the neighboring article whereas the in links should start at the neighboring article and end at "Hillary Rodham Clinton".
In [31]:
hrc_alllink_outlist[:5]
Out[31]:
In [32]:
hrc_g = nx.DiGraph()
for article in hrc_alllink_outlist:
hrc_g.add_edge("Hillary Rodham Clinton",article)
for article in hrc_alllink_inlist:
hrc_g.add_edge(article,"Hillary Rodham Clinton")
We can compute some basic statistics about the network such as the number of nodes.
In [34]:
len(hrc_alllink_outlist) + len(hrc_alllink_inlist)
Out[34]:
In [33]:
hrc_g.number_of_nodes()
Out[33]:
In [122]:
print "There are {0} edges and {1} nodes in the network".format(hrc_g.number_of_edges(), hrc_g.number_of_nodes())
We might also ask how many of these hyperlink edges are reciprocated, or link in both directions. We start with an empty container reciprocal_edges
we'll use to fill with edges that are reciprocated. Next, we iterate through all the edges in the graph (hrc_g.edges()
returns a list of all edges) and check two things. The first check is for whether the graph contains an edge the goes in the opposite direction. So given an edge (i
,j
), we check if there's also a (j
,i
). The second check is to make sure we haven't already added this edge to the reciprocal_edges
list. If both these conditions are true, then we can add the edge to reciprocal_edges
.
In [35]:
reciprocal_edges = list()
for (i,j) in hrc_g.edges():
if hrc_g.has_edge(j,i) and (j,i) not in reciprocal_edges:
reciprocal_edges.append((i,j))
reciprocation_fraction = round(float(len(reciprocal_edges))/hrc_g.number_of_edges(),3)
print "There are {0} reciprocated edges out of {1} edges in the network, giving a reciprocation fraction of {2}.".format(len(reciprocal_edges),hrc_g.number_of_edges(),reciprocation_fraction)
We can compare this to the network for Bill Clinton. There are many more edges in his network, but a much smaller fraction of these edges are reciprocated. This suggests that there are fewer articles expressing some similarity or relationship with Bill Clinton that his article also acknowledges by linking. This in turn invites questions about:
With the query we've covered above, you canbegin to answer these open questions.
In [36]:
bc_g = nx.DiGraph()
for article in bc_out:
bc_g.add_edge("Bill Clinton",article)
for article in bc_in:
bc_g.add_edge(article,"Bill Clinton")
bc_reciprocal_edges = list()
for (i,j) in bc_g.edges():
if bc_g.has_edge(j,i) and (j,i) not in bc_reciprocal_edges:
bc_reciprocal_edges.append((i,j))
bc_reciprocation_fraction = round(float(len(bc_reciprocal_edges))/bc_g.number_of_edges(),3)
print "There are {0} reciprocated edges out of {1} edges in the network, giving a reciprocation fraction of {2}.".format(len(bc_reciprocal_edges),bc_g.number_of_edges(),bc_reciprocation_fraction)
This is a pretty basic "star"-shaped network that contains Hillary's article at the center and is surrounded by all the articles linking to and from it. In particular, we could "snowball" out from the the articles that link to and are linked from a given page and visit each of those articles and create their local networks. We could continue to do this until we traverse the whole hyperlink network, but that would take a very long time, involve a lot of data, and would be an abusive use of the API (if you want the whole hyperlink network, you can download the data directly here by clicking a backup date and searching for "Wiki page-to-page link records.").
We could also create the "1.5-step ego" hyperlink network around a given page that consists of the focal article, all the articles that link to or from it, and then whether these neighboring articles are linked to each other. This could provide a better picture of which neighboring articles link to which other articles.
Unfortunately, even the scrape for the 2-step ego hyperlink network could take over an hour of data collection and generate hundreds of megabytes of data. Furthermore, Wikipedia article also contain templates which creates lots of "redundant" links between articles that share templates even those these links don't appear in the body of the article itself. You'll need to do much more advanced text parsing of wiki-markup to actually get links in the body of an article, but that's beyond the scope of the present tutorial.
I don't recommend crawling more than the immediate (1-step) neighbors of Wikipedia articles.
The queries above only looked at the links coming from the current version of the article. However Wikipedia archives every version of the article, so we can rewind the tape all the way back to the first version of Hillary's article back in 2001, a few months after Wikipedia was created. Specific versions of a Wikipedia article are identified with a revid
, which is also called an oldid
in some contexts. In subsequent sections, we'll go into more detail on how to get a list of all revisions to an article and find the oldest revision. But for the time being, just trust me that revid
"256189" is the oldest version of the Hillary Rodham Clinton article. Take a peek at what the article looked like back then below:
In [37]:
HTML('<iframe src=https://en.wikipedia.org/w/index.php?title=Hillary_Rodham_Clinton&oldid=256189&useformat=mobile width=700 height=350></iframe>')
Out[37]:
The MediaWiki API allows us to extract the out links from this old version of the article. Here we'll perform a different kind of action
on the API than the previous query
parameter we've used. The action=parse
will extract information from a given version of an article, such as the links. We can specify that links should be parsed out with the prop=links
parameter. Finally, we pass the oldid=256189
so that this specific revision is parsed.
In [38]:
oldest_outlinks_query_hrc = {'action': 'parse', #query changes to parse
'prop': 'links',
'oldid': '256189'}
oldest_outlinks_data = wikipedia_query(oldest_outlinks_query_hrc)
oldest_outlinks_data
Out[38]:
Again, data processing and cleanup using a list comprehension is necessary to get a list of links from this result.
In [86]:
oldest_outlink_list = [link['*'] for link in oldest_outlinks_data['links']]
print "There are {0} out links from the Hillary Rodham Clinton article".format(len(oldest_outlink_list))
oldest_outlink_list
Out[86]:
So now we can also extract links from historical versions of the article. However, it's much more difficult to get the history of what links in to an article (e.g., linkshere
) as this would require potentially looking at the history of every other article to check if a link was ever made from that article to another article. This is not impossible, just very very time-consuming.
In this section we learned to write and combine queries to get us the links to and from the current version of an article, clean the output of these queries up into lists of links, use these lists of links to make a network object, and did some preliminary analysis of an article's ego network. There are some limitations on the specificity of the links that the API passes back which limits our ability to generate more complex networks using this query. We also showed that it's possible to get the out links from a historical version of an article using a new kind of API action called a parse
. Using the out links from all the changes to an article could let us look at the evolution of what the article linked to over time. We'll go into how to get all the changes to an article in the next section.
The previous section showed how to make a basic network from the current hyperlinks to and from a Wikipedia article. It also alluded to the fact that Wikipedia captures the history of every change made to the article since it was created as well as who made these changes and when (among other meta-data). In this section, we'll explore some queries around how to extract the "revision history" of an article from the API. We'll do some exploratory analysis using these data to understand patterns in the distribution of editors' activity, changes in content, and the persistence of revisions. Additionally, we'll construct a co-authorship network of what editors made a change to the article.
Starting with a basic query, we'll get every change that's been made to the "Hillary Rodham Clinton" article. We'll use action=query
and prop=revisions
to get the list of changes to an article (see detailed documentation here). There are many options to specify here. We pass several options to rvprop
to get the revision ids, timestamp, user, user ID, revision comment, and the size of the article; "max" to rvlimit
to get all the revisions; "newer" to rvdir
so the revisions come back in chronological order (oldest to newest). There are many other options that can be specified such as rvprop=content
to get the content of each revision or rvstart
and rvend
to get revisions within a specific timeframe, and rvexcludeuser
to omit changes from bots for example.
In [135]:
revisions_query_hrc = {'action': 'query',
'redirects': 'True',
'prop': 'revisions',
'titles': "Hillary Rodham Clinton",
'rvprop': 'ids|user|timestamp|userid|comment|size',
'rvlimit': '500',
'rvdir': 'newer'}
revisions_data_hrc = wikipedia_query(revisions_query_hrc)
There's a lot of data in there and you can already expect that we'll need to do some data processing and cleaning to get it into a more usable form.
DataFrame
object that we'll call hrc_rv_df
. timestamp
column inside of this new DataFrame are still strings rather than meaningful dates that we can sort on, so we need to convert them using the to_datetime
function and passing a strftime
formatting magic so that we know which string sequences correspond to meaningful years, months, days, hours, minutes, and seconds values. anon
column has a strange mixture of NaN and empty strings corresponding to whether the revision was made by a registered account or now. The replace
method swaps the NaNs out with False
and the strings with True
booleans to make this more interpretable. timestamp
values, reset the index (row numbers) so they correspond to the revision count, and label this index as "revision".hrc_revisions.csv
making sure that we encode non-ASCII characters in "utf8". You'll want to make a habit out of doing this.
In [271]:
# Extract and convert to DataFrame
hrc_rv_df = pd.DataFrame(revisions_data_hrc['pages']['5043192']['revisions'])
# Make it clear what's being edited
hrc_rv_df['page'] = [u'Hillary Rodham Clinton']*len(hrc_rv_df)
# Clean up timestamps
hrc_rv_df['timestamp'] = pd.to_datetime(hrc_rv_df['timestamp'],format="%Y-%m-%dT%H:%M:%SZ",unit='s')
# Clean up anon column
hrc_rv_df = hrc_rv_df.replace({'anon':{np.nan:False,u'':True}})
# Sort the data on timestamp and reset the index
hrc_rv_df = hrc_rv_df.sort('timestamp').reset_index(drop=True)
hrc_rv_df.index.name = 'revision'
hrc_rv_df = hrc_rv_df.reset_index()
# Set the index to a MultiIndex
hrc_rv_df.set_index(['page','revision'],inplace=True)
# Save the data to disk
hrc_rv_df.to_csv('hrc_revisions.csv',encoding='utf8')
# Show the first 5 rows
hrc_rv_df.head()
Out[271]:
We might be interested in looking at the most active editors over the history of the article. We can perform a groupby
operation that effectively creates a mini-DataFrame for each user's revisions. We use the aggregate
function to collection information (len
gets us the number of revisions they made) across all these mini-DataFrames and returns a Series
object with the username and the number of their revisions. Sorting these revisions is descending order and then look at the top-5 revisions, which exhibits variation over nearly two orders of magnitude.
In [272]:
hrc_rv_gb_user = hrc_rv_df.groupby('user')
hrc_user_revisions = hrc_rv_gb_user['revid'].aggregate(len).sort(ascending=False,inplace=False)
print "There are {0} unique users who have made a contribution to the article.".format(len(hrc_user_revisions))
hrc_user_revisions.head(10)
Out[272]:
Given the wide variation among the number of contributions from users, we can create a kind of "histogram" that plots how many users made how many revisions. Because there is so much variation in the data, we use logged axes. In the upper left, there are several thousand editors who made only a single contribution. In the lower right, are the single editors listed above who made several hundred revisions to this article.
In [273]:
revisions_counter = Counter(hrc_user_revisions.values)
plt.scatter(revisions_counter.keys(),revisions_counter.values(),s=50)
plt.ylabel('Number of users',fontsize=15)
plt.xlabel('Number of revisions',fontsize=15)
plt.yscale('log')
plt.xscale('log')
We'll add some information to the DataFrame about the cumulative number of unique users who've ever edited the article. This should give us a sense of how the size of the collaboration changed over time. Starting with empty lists for unique_users
that we will add the names of users to as they make their first edit and unique_count
that counts the number of unique users at each point in time. We add the unique_count
list to the DataFrame under the unique_users
column.
In [307]:
def count_unique_users(user_series):
unique_users = []
unique_count = []
for user in user_series.values:
if user not in unique_users:
unique_users.append(user)
unique_count.append(len(unique_users))
else:
unique_count.append(unique_count[-1])
return unique_count
hrc_rv_df['unique_users'] = count_unique_users(hrc_rv_df['user'])
We can look at changes to the contribution patterns on the article over time. First we need to do some data processing to convert the timestamps into generic dates. Then we group the activity by date together and use aggregate to create a new DataFrame called activity_by_day
that contains the number of unique users and number of revisions made on each day. Finally, plot the distribution of this activity over time.
Looking at the blue line for the number of unique users, we see the collaboration is initially small through 2004, but then between 2005 and 2008 undergoes rapid growth from a few hundred editors to over 3,000 editors. After 2008 however, the number of new news grows much more slowly and constantly. This is somewhat surprising as this timeframe includes a number of historic events like Hillary's campaign for president in 2008 as well as her tenure as Secretary of State.
Looking at the green line for the number of revisions made per day, there is a lot of variation in daily editing activity, but much of it seems to again occur between 2005 and 2009, and slows down substantially thereafter. Peaks might correspond to major news events (like nominations) or to edit wars (editors fighting over content).
In [275]:
hrc_rv_df['date'] = hrc_rv_df['timestamp'].apply(lambda x:x.date())
activity_by_day = hrc_rv_df.groupby('date').aggregate({'unique_users':max,'revid':len})
ax = activity_by_day.plot(lw=1,secondary_y=['revid'])
ax.set_xlabel('Time',fontsize=15)
Out[275]:
We can also look at the distribution in changes to the article's size. In other words, how much content (in bytes) was introduced or removed from the article by an editor's changes? We see there is a very wide (axes are still on log scales) and mostly symmetrical distribution in additions and removals of content. In other words, the most frequent changes are extremely minor (-1 to 1 bytes) and the biggest changes (dozens of kilobytes) are very rare --- and likely the result of vandalism and reversion of vandalism. Nevertheless it's the case that this Wikipedia article's history is as much about the removal of content as it is about the addition of content.
In [276]:
hrc_rv_df['diff'] = hrc_rv_df['size'].diff()
diff_counter = Counter(hrc_rv_df['diff'].values)
plt.scatter(diff_counter.keys(),diff_counter.values(),s=50,alpha=.1)
plt.xlabel('Difference (bytes)',fontsize=15)
plt.ylabel('Number of revisions',fontsize=15)
plt.yscale('log')
plt.xscale('symlog')
Re-compute the activity_by_day
DataFrame to include the diff
variable computed above using the np.median
method to get the median change in the article on a given day. Substantively, this means that we can track how much content was added or removed on each day. This is noisy, so we can smooth using rolling_mean
and specifying a 60-day window. There's a general tendency for the articl to grow on any given day, but there are a few time periods when the article shrinks drastically, likely reflecting sections of an article being split out into sub-articles.
In [277]:
activity_by_day = hrc_rv_df.groupby('date').aggregate({'unique_users':max,
'revid':len,
'diff':np.median})
# Compute a 60-day rolling average to remove spikiness, plot
pd.rolling_mean(activity_by_day['diff'],60).plot()
plt.yscale('symlog')
plt.xlabel('Time',fontsize=15)
plt.ylabel('Difference (bytes)',fontsize=15)
plt.axhline(0,lw=2,c='k')
Out[277]:
We can also explore how long an edit persists on the article before another edit is subsequently made. The average edit only persists for ~34,500 seconds (~9.5 hours) but the median edit only persists for 881 seconds (~15 minutes).
In [278]:
# The diff returns timedeltas, but dividing by a 1-second timedelta returns a float
# Round these numbers off to smooth out the distribution and add 1 second to everything to make the plot behave
hrc_rv_df['latency'] = [round(i/np.timedelta64(1,'s'),-1) + 1 for i in hrc_rv_df['timestamp'].diff().values]
diff_counter = Counter(hrc_rv_df['latency'].values)
plt.scatter(diff_counter.keys(),diff_counter.values(),s=50,alpha=.1)
plt.xlabel('Latency time (seconds)',fontsize=15)
plt.ylabel('Number of changes',fontsize=15)
plt.yscale('log')
plt.xscale('log')
In [279]:
hrc_rv_df['latency'].describe()
Out[279]:
As we did above, we can recompute activity_by_day
to include daily median changes in the latency between edits. There is substantial variation in how long edits persist. Again, the pre-2006 era is marked by content that goes days or weeks without changes, but between 2006 and 2009 the time between edits becomes much shorter, presumably corresponding with the attention around her presidential campaign. After 2008, the time between changes increases again and stabilizes at its (smoothed) current value of around 2 days between edits.
In [280]:
activity_by_day = hrc_rv_df.groupby('date').aggregate({'unique_users':max,
'revid':len,
'diff':np.median,
'latency':np.median})
# Compute a 60-day rolling average to remove spikiness, plot
pd.rolling_mean(activity_by_day['latency'],60).plot()
plt.yscale('symlog')
plt.xlabel('Time',fontsize=15)
plt.ylabel('Latency time (seconds)',fontsize=15)
Out[280]:
We previously created a direct network of hyperlinks where the nodes were all articles and the edges indicated the direction of the link(s) between the central article and its neighbors. In this section, we're going to construct a different kind of network that contains a mixture of editors and articles and the edges indicate whether an editor contributed to an article. For simplicity's sake, we're going to start with the 1-step ego co-authorship network with the "Hillary Rodham Clinton" article and the set of editors who have ever made changes to it. Because there are two-types of nodes in this network (articles and editors) and editors can't edit editors and articles can't edit articles, we call this network a "bipartite network" (also known as an "affiliation" or "two-mode" network).
Even though bipartite networks are traditionally undirected, we're going to use a directed network because NetworkX
does some wacky things when using an undirected network with bipartite properties. We're also going to make this a weighted network where the edges have values that correspond to the number of times an editor made a change to the article. This basically replicates the analysis we did above in "User Activity" but is an example of the information from the revision history that we might want to include in the network representation.
We go over every user in the user
column inside hrc_rv_df
and first check whether or not a (user
,"Hillary Rodham Clinton") edge exists. If one already exists, then we increment its weight
attribute by 1. Otherwise if there is no such edge in the network, we add a (user
,"Hillary Rodham Clinton") edge with a weight
of 1. We can inspect five of the edges to make sure this worked.
In [289]:
hrc_bg = nx.DiGraph()
for user in hrc_rv_df['user'].values:
if hrc_bg.has_edge(user,u'Hillary Rodham Clinton'):
hrc_bg[user][u'Hillary Rodham Clinton']['weight'] += 1
else:
hrc_bg.add_edge(user,u'Hillary Rodham Clinton',weight=1)
print "There are {0} nodes and {1} edges in the network.".format(hrc_bg.number_of_nodes(),hrc_bg.number_of_edges())
hrc_bg.edges(data=True)[:5]
Out[289]:
Based on everything we did in the previous analysis to query the revisions, reshape and clean up the data, and extract new features for analysis, we are now going to write a big function that does all of this automatically. The function get_revision_df
will accept an article name, perform the query, and proceed to do many of the steps outlined above, and returns a cleaned DataFrame at the end.
In [39]:
def get_revision_df(article):
revisions_query = {'action': 'query',
'redirects': 'True',
'prop': 'revisions',
'titles': article,
'rvprop': 'ids|user|timestamp|user|userid|comment|size',
'rvlimit': '500',
'rvdir': 'newer'}
revisions_data = wikipedia_query(revisions_query)
page_id = revisions_data['pages'].keys()[0]
# Extract and convert to DataFrame. Try/except for links to pages that don't exist
try:
df = pd.DataFrame(revisions_data['pages'][page_id]['revisions'])
except KeyError:
print u"{0} doesn't exist!".format(article)
pass
# Make it clear what's being edited
df['page'] = [article]*len(df)
# Clean up timestamps
df['timestamp'] = pd.to_datetime(df['timestamp'],format="%Y-%m-%dT%H:%M:%SZ",unit='s')
# Clean up anon column. If/else for articles that have all non-anon editors
if 'anon' in df.columns:
df = df.replace({'anon':{np.nan:False,u'':True}})
else:
df['anon'] = [False] * len(df)
# Sort the data on timestamp and reset the index
df = df.sort('timestamp').reset_index(drop=True)
df.index.name = 'revision'
df = df.reset_index()
# Set the index to a MultiIndex
df.set_index(['page','revision'],inplace=True)
# Compute additional features
df['date'] = df['timestamp'].apply(lambda x:x.date())
df['diff'] = df['size'].diff()
df['unique_users'] = count_unique_users(df['user'])
df['latency'] = [round(i/np.timedelta64(1,'s'),-1) + 1 for i in df['timestamp'].diff().values]
# Don't return random other columns
df = df[[u'anon',u'comment',u'parentid',
u'revid',u'size',u'timestamp',
u'user',u'userid',u'unique_users',
u'date', u'diff', u'latency']]
return df
Try this out on "Bill Clinton".
In [79]:
bc_rv_df = get_revision_df("Bill Clinton")
bc_rv_df.head()
We've created a DataFrame for both Hillary's revision history (hrc_rv_df
) as well as Bill's revision history (bc_df
). We can now combine both of these together (cross your fingers!!!) using the concat
method. We can check to make sure that they both made it into the dataframe by checking the first level of the index
and we see they're both there. We also save all the data we've scraped and cleaned to disk --- the resulting file takes up just under 5 MB.
In [321]:
clinton_df = pd.concat([bc_rv_df,hrc_rv_df])
print clinton_df.index.levels[0]
print "There are a total of {0} revisions across both the Hillary and Bill Clinton articles.".format(len(clinton_df))
clinton_df.to_csv('clinton_revisions.csv',encoding='utf8')
We are going to use these data to create a coauthorship network of all the editors who contributed to both these articles. If we've already crawled this data, we can just load it from disk, specifying options to make sure we have the right encoding, the columns are properly indexed, and the dates are parsed.
In [41]:
clinton_df = pd.read_csv('clinton_revisions.csv',
encoding='utf8',
index_col=['page','revision'],
parse_dates=['timestamp','date'])
clinton_df.head()
Out[41]:
We want to create an "edgelist" that contains all the (editor
, article
) pairs of who contributed to which articles. This could by done by looping over the list, but this is inefficient on larger datasets like the one we crawled. Instead, we'll use a groupby
approach to not only count the number of times an editor contributed to an article (the weight
we defined previous), but a whole host of other potentially interesting attributes.
We use the agg
method on the data that's been grouped by page
and user
to aggregate the information into nice summary statistics. We count the number of revisions using len
and relabel this variable weight
. For the timestamp
, diff
, latency
, and revision
variables, we compute new summary statistics for the minimum, median, and maximum values. This operation returns a new DataFrame, indexed by (page
, user
) with columns corresponding to labels like weight
, ts_min
, etc. Each row in this DataFrame will become attributes in the graph object we make below. This operation creates a weird multi-column, so we drop the redundant 0-level of the column to have a nice concise column.
We're going to do something different to the timestamp
data because these data are stored as Timestamp
objects that don't always place nicely with other functions. Instead, we're going to convert these data to counts for the amount of time (in days) since January 16, 2001, the date that Wikipedia was founded. In effect, we're counting how "old" Wikipedia was when an action occurred and this float
count will work better in subsequent steps.
In [43]:
clinton_gb_edge = clinton_df.reset_index().groupby(['page','user'])
clinton_edgelist = clinton_gb_edge.agg({'revid':{'weight':len},
'timestamp':{'ts_min':np.min,'ts_max':np.max},
'diff':{'diff_min':np.min,'diff_median':np.median,'diff_max':np.max},
'latency':{'latency_min':np.min,'latency_median':np.median,'latency_max':np.max},
'revision':{'revision_min':np.min,'revision_max':np.max}
})
# Drop the legacy/redundant column names
clinton_edgelist.columns = clinton_edgelist.columns.droplevel(0)
# Convert the ts_min and ts_max to floats for the number of days since Wikipedia was founded
clinton_edgelist['ts_min'] = (clinton_edgelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
clinton_edgelist['ts_max'] = (clinton_edgelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
clinton_edgelist.head()
Out[43]:
The nodes in this bipartite network also have attributes we can extract from the data. Remember, because this is a bipartite network, we'll need to generate attribute data for both the users and the pages. We can perform an analogous groupby
operation as we used above, but simply group on either the user
or the page
values. After each of these groupby
operations, we can perform similar agg
operations to aggregate the data into summary statistics. In the case of the user, these summary statistics are across all articles in the data. Thus the clinton_usernodelist
summarizes how many total edits a user made, their first and last observed edits, and the distribution of their diff
, latency
, and revision
statistics. The clinton_pagenodelist
summarizes how many total edits were made to the page, the date of the first and last edit, and so on.
In [44]:
# Create the usernodelist by grouping on user and aggregating
clinton_gb_user = clinton_df.reset_index().groupby(['user'])
clinton_usernodelist = clinton_gb_user.agg({'revid':{'revisions':len},
'timestamp':{'ts_min':np.min,'ts_max':np.max},
'diff':{'diff_min':np.min,'diff_median':np.median,'diff_max':np.max},
'latency':{'latency_min':np.min,'latency_median':np.median,'latency_max':np.max},
'revision':{'revision_min':np.min,'revision_median':np.median,'revision_max':np.max}
})
# Clean up the columns and convert the timestamps to counts
clinton_usernodelist.columns = clinton_usernodelist.columns.droplevel(0)
clinton_usernodelist['ts_min'] = (clinton_usernodelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
clinton_usernodelist['ts_max'] = (clinton_usernodelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
# Create the usernodelist by grouping on page and aggregating
clinton_gb_page = clinton_df.reset_index().groupby(['page'])
clinton_pagenodelist = clinton_gb_page.agg({'revid':{'revisions':len},
'timestamp':{'ts_min':np.min,'ts_max':np.max},
'diff':{'diff_min':np.min,'diff_median':np.median,'diff_max':np.max},
'latency':{'latency_min':np.min,'latency_median':np.median,'latency_max':np.max},
'revision':{'revision_min':np.min,'revision_median':np.median,'revision_max':np.max}
})
# Clean up the columns and convert the timestamps to counts
clinton_pagenodelist.columns = clinton_pagenodelist.columns.droplevel(0)
clinton_pagenodelist['ts_min'] = (clinton_pagenodelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
clinton_pagenodelist['ts_max'] = (clinton_pagenodelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
clinton_pagenodelist.head()
Out[44]:
Now that we've created all this rich contextual data about edges, pages, and editors, we can load it all into a NetworkX
DiGraph
object called clinton_g
. We start by looping over the index in the clinton_edgelist
dataframe that corresponds to the edges in the network, convert the edge attributes to a dictionary for NetworkX
to better digest, and then add this edge and all its data to the clinton_g
graph object. This creates placeholder nodes, but we want to add the rich node data we created above as well. We can loop over the clinton_usernodelist
, convert the node attributes to a dictionary, and then overwrite the placeholder nodes by adding the data-rich user nodes to the clinton_g
graph object. We do the same for the clinton_pagenodelist
, then check the number of nodes and edges in the network, and finally print out a few examples of the data-rich nodes and edges.
In [45]:
clinton_g = nx.DiGraph()
# Add the edges and edge attributes
for (article,editor) in iter(clinton_edgelist.index.values):
edge_attributes = dict(clinton_edgelist.ix[(article,editor)])
clinton_g.add_edge(editor,article,edge_attributes)
# Add the user nodes and attributes
for node in iter(clinton_usernodelist.index):
node_attributes = dict(clinton_usernodelist.ix[node])
clinton_g.add_node(node,node_attributes)
# Add the page nodes and attributes
for node in iter(clinton_pagenodelist.index):
node_attributes = dict(clinton_pagenodelist.ix[node])
clinton_g.add_node(node,node_attributes)
print "There are {0} nodes and {1} edges in the network.".format(clinton_g.number_of_nodes(),clinton_g.number_of_edges())
clinton_g.edges(data=True)[:3]
Out[45]:
Now it's time to do a really audacious data scrape. We're going to get the revision histories for all 2,646 articles linked to and from Hillary's article. The data will be stored in the dataframe_dict
dictionary that will be keyed by article title and the values will be the dataframes themselves. We'll use a for
loop to go over every article in the all_links
and call the get_revision_df
function we defined and tested above to get the cleaned revision DataFrame and store it in the dataframe_dict
object. Because this scrape may take a while, we're going to put in some exception handling (try, except) so that if an error occurs, we don't lose all our progress. When an exception occurs, we'll add the article name to the errors
list so we can go back and check what happened.
We'll concatenate all these DataFrames together into a gigantic DataFrame containing all the data we've scraped and then save it. This is a 485 MB file!
This will take a long time and a lot of memory!!! To prevent you from accidentally executing this, the block below is in a "raw" format that you'll need to convert to "Code" from the dropdown above.
And there are nearly 3 millions revisions in the dataset!
In [338]:
len(gigantic_df)
Out[338]:
The analysis can start again here by loading the CSV file rather than having to re-scrape the data from above. Loading the file to gigantic_df
, there are a few rows that seem to be broken, so we'll use drop
to remove them. We also use to_datetime
to make sure the timestamp information is using the appropriate units.
In [46]:
gigantic_df = pd.read_csv('gigantic_df.csv',
encoding='utf8',
index_col=['page','revision'],
parse_dates=['timestamp','date']
)
gigantic_df = gigantic_df.drop(("[[History of the United States]] at [[History of the United States#British colonization|British Colonization]]. ([[WP:TW|TW]])",589285361))
gigantic_df = gigantic_df.drop(("United States",32868))
gigantic_df['timestamp'] = pd.to_datetime(gigantic_df['timestamp'],unit='s')
gigantic_df['date'] = pd.to_datetime(gigantic_df['date'],unit='d')
gigantic_df.head()
Out[46]:
Now do all the groupby
and agg
operations to create the edgelists and nodelists we'll need to make a network as well as the data cleanup steps we did above.
In [47]:
edge_agg_function = {'revid':{'weight':len},
'timestamp':{'ts_min':np.min,'ts_max':np.max},
'diff':{'diff_min':np.min,'diff_median':np.median,'diff_max':np.max},
'latency':{'latency_min':np.min,'latency_median':np.median,'latency_max':np.max},
'revision':{'revision_min':np.min,'revision_max':np.max}
}
# Create the edgelist by grouping on both page and user
gigantic_gb_edge = gigantic_df.reset_index().groupby(['page','user'])
gigantic_edgelist = gigantic_gb_edge.agg(edge_agg_function)
# Drop the legacy/redundant column names
gigantic_edgelist.columns = gigantic_edgelist.columns.droplevel(0)
# Convert the ts_min and ts_max to floats for the number of days since Wikipedia was founded
gigantic_edgelist['ts_min'] = (gigantic_edgelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
gigantic_edgelist['ts_max'] = (gigantic_edgelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
print "There are {0} edges in the network.".format(len(gigantic_edgelist))
In [48]:
node_agg_function = {'revid':{'revisions':len},
'timestamp':{'ts_min':np.min,'ts_max':np.max},
'diff':{'diff_min':np.min,'diff_median':np.median,'diff_max':np.max},
'latency':{'latency_min':np.min,'latency_median':np.median,'latency_max':np.max},
'revision':{'revision_min':np.min,'revision_median':np.median,'revision_max':np.max}
}
# Create the usernodelist by grouping on user and aggregating
gigantic_gb_user = gigantic_df.reset_index().groupby(['user'])
gigantic_usernodelist = gigantic_gb_user.agg(node_agg_function)
# Clean up the columns and convert the timestamps to counts
gigantic_usernodelist.columns = gigantic_usernodelist.columns.droplevel(0)
gigantic_usernodelist['ts_min'] = (gigantic_usernodelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
gigantic_usernodelist['ts_max'] = (gigantic_usernodelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
print "There are {0} editor nodes in the network.".format(len(gigantic_usernodelist))
gigantic_usernodelist.head()
Out[48]:
In [49]:
# Create the usernodelist by grouping on page and aggregating
gigantic_gb_page = gigantic_df.reset_index().groupby(['page'])
gigantic_pagenodelist = gigantic_gb_page.agg(node_agg_function)
# Clean up the columns and convert the timestamps to counts
gigantic_pagenodelist.columns = gigantic_pagenodelist.columns.droplevel(0)
gigantic_pagenodelist['ts_min'] = (gigantic_pagenodelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
gigantic_pagenodelist['ts_max'] = (gigantic_pagenodelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
print "There are {0} page nodes in the network.".format(len(gigantic_pagenodelist))
gigantic_pagenodelist.head()
Out[49]:
Having created the edge and node lists in the previous step, we can now add these data to a NetworkX
DiGraph
object we'll call gigantic_g
. As before, we add the edges and edge attributes from gigantic_edgelist
and then add the nodes and node attribtues from gigantic_usernodelist
and gigantic_pagenodelist
. We perform a dictionary comprehension to convert the values of the attribtues in the dictionary to float
data type rather than the numpy.float64
which doesn't play nicely with the graph writing function in NetworkX
. And then we can do the "grand reveal" to describe the coauthorship network of the articles in the hyperlink network neighborhood of Hillary's article.
In [70]:
gigantic_g = nx.DiGraph()
# Add the edges and edge attributes
for (article,editor) in iter(gigantic_edgelist.index.values):
edge_attributes = dict(gigantic_edgelist.ix[(article,editor)])
edge_attributes = {k:float(v) for k,v in edge_attributes.iteritems()}
gigantic_g.add_edge(editor,article,edge_attributes)
# Add the user nodes and attributes
for node in iter(gigantic_usernodelist.index):
node_attributes = dict(gigantic_usernodelist.ix[node])
node_attributes = {k:float(v) for k,v in node_attributes.iteritems()}
gigantic_g.add_node(node,node_attributes)
# Add the page nodes and attributes
for node in iter(gigantic_pagenodelist.index):
node_attributes = dict(gigantic_pagenodelist.ix[node])
node_attributes = {k:float(v) for k,v in node_attributes.iteritems()}
gigantic_g.add_node(node,node_attributes)
print "There are {0} nodes and {1} edges in the network.".format(gigantic_g.number_of_nodes(),gigantic_g.number_of_edges())
gigantic_g.edges(data=True)[:3]
Out[70]:
Finally, having gone through all this effort to make a co-authorship network with such rich attributes and complex properties, we should save our work. There are many different file formats for storing network objects to disk, but the two I use the most are "graphml" and "gexf". They do slightly different things, but they're generally interoperable and compatible with many programs for visualizing networks like Gephi.
In [71]:
nx.write_graphml(gigantic_g,'gigantic_g.graphml')
Now let's perform some basic network analyses on this gigantic graph we've created. An extremely easy and important metric to compute is the degree centrality of nodes in the network: how well-connected a node is based on the number of edges it has to other nodes. We use the directed nature of the edges to distinguish between articles (which receive links in) and editors (which send links out) to compute the in- and out-degree centralities respectively with the nx.in_degree_centrality
and nx.out_degree_centrality
functions. These functions return a normalized degree centrality, where the values aren't the integer count of the number of connected edges, but rather the fraction of other nodes connected to which it's connected. The values are recorded in a dictionary keyed by the node ID (article title or user name), which are saved as g_idc
and g_odc
.
In [51]:
g_idc = nx.in_degree_centrality(gigantic_g)
g_odc = nx.out_degree_centrality(gigantic_g)
We can use a fancy bit of programming called itemgetter
to quickly sort these dictionaries and return the 10-top connected articles and users. Hillary, despite being the central node we started at, is not actually the best-connected article, but rather other major people and entities. The top editors, interestingly enough are not actually people, but automated bots who perform a variety of maintenance and cleanup tasks across articles.
In [58]:
sorted(g_idc.iteritems(), key=itemgetter(1),reverse=True)[:10]
Out[58]:
In [53]:
sorted(g_odc.iteritems(), key=itemgetter(1),reverse=True)[:10]
Out[53]:
We can plot a histogram of connectivity patterns for the articles and editors, which shows a very skewed distribution: most editors edit only a single article while there are single editors who make thousands of contributions. The distribution for articles shows a less severe but still very long-tailed distribution of contribution patterns.
In [54]:
g_size = gigantic_g.number_of_nodes()
g_idc_counter = Counter([v*(g_size-1) for v in g_idc.itervalues() if v != 0])
g_odc_counter = Counter([v*(g_size-1) for v in g_odc.itervalues() if v != 0])
plt.scatter(g_idc_counter.keys(),g_idc_counter.values(),s=50,c='b',label='Articles')
plt.scatter(g_odc_counter.keys(),g_odc_counter.values(),s=50,c='r',label='Editors')
plt.yscale('log')
plt.xscale('log')
plt.xlabel('Number of connections',fontsize=15)
plt.ylabel('Number of nodes',fontsize=15)
plt.legend(loc='upper right',scatterpoints=1)
Out[54]:
We can also look at the distribution of edge weights, or the number of times that an editor contributed to an article. We could do this using the gigantic_edgelist
DataFrame, but let's practice using the data attributes we've stored in the graph object. Using a list comprehension as before, we iterate (note the use of edes_iter(data=True)
to both be more memory efficient and to return the edge attributes) over the edges which return a tuple (i
,j
,attributes_dict
). We access the tuples' weights and store them in the list weights
. Proceed with a Counter
operation and then plot the results.
In [55]:
weights = [attributes['weight'] for i,j,attributes in gigantic_g.edges_iter(data=True)]
weight_counter = Counter(weights)
plt.scatter(weight_counter.keys(),weight_counter.values(),s=50,c='b',label='Weights')
plt.yscale('log')
plt.xscale('log')
plt.xlabel('Number of contributions',fontsize=15)
plt.ylabel('Number of edges',fontsize=15)
Out[55]:
We can compute another degree-related metric called "assortativity" that measures how well connected your neighbors are on average. We compute this statistic on the set of article and editor nodes using the nx.assortativity.average_degree_connectivity
function with special attention to the direction of the ties as well as limiting the nodes to those in the set of pages or users, respectively. Plotting the distribution, both the articles and the editors exhibit negative correlations. In other words, for those editors (articles) connected with few articles (editors), those articles (editors) have a tendency to be well-connected to other nodes in the set. Conversely, for those editors (articles) connected with many articles (editors), those articles (editors) have a tendency to be poorly connected to other nodes in the set. Articles exhibit a stronger correlation than editors.
In [56]:
article_nn_degree = nx.assortativity.average_degree_connectivity(gigantic_g,source='in',target='out',nodes=gigantic_pagenodelist.index)
editor_nn_degree = nx.assortativity.average_degree_connectivity(gigantic_g,source='out',target='in',nodes=gigantic_usernodelist.index)
plt.scatter(article_nn_degree.keys(),article_nn_degree.values(),s=50,c='b',label='Articles',alpha=.5)
plt.scatter(editor_nn_degree.keys(),editor_nn_degree.values(),s=50,c='r',label='Editors',alpha=.5)
plt.yscale('log')
plt.xscale('log')
plt.xlabel('Degree',fontsize=15)
plt.ylabel('Average neighbor degree',fontsize=15)
plt.legend(loc='upper right',scatterpoints=1)
Out[56]:
Just trying other things.
In [72]:
gigantic_g.edges(data=True)[1]
Out[72]:
In [75]:
edge_weight_centrality = [list(),list()]
for (i,j,attributes) in gigantic_g.edges_iter(data=True):
edge_weight_centrality[0].append(g_idc[j] - g_idc[i])
edge_weight_centrality[1].append(attributes['weight'])
In [77]:
plt.scatter(edge_weight_centrality[0],edge_weight_centrality[1])
plt.yscale('log')
Throughout this section, we've gotten the links for a single article. As we did with the user information in the previous section, we can wrap these queries in a function so that they're easier to run. Once we do this, we can do more interesting things like examine the hyperlink ego-network surrounding a single article.
We take the lists of linked articles we extracted from Hillary's article and iterate over them, getting the lists of articles for each of them. We need to place the output of these into a larger data object that will hold everything. I'll use a dictionary keyed by article name that returns a dictionary containing the lists of links for that article. We'll put Hillary's data in there to start it up, but we'll add more.
Next we come up with a list of articles we're going to iterate over. We could just add the outlist
and inlist
articles together, but there might be redundancies in there. Instead we'll cast these lists into set
containing only unique article names and the union of these sets creates a master set of all unique article names. Then convert this joined set back into a list called all_links
so we can iterate over it.
This set of unique links has 2,646 articles in it, which will take some time to scrape all the data. This may take over an hour to run and will generate ~190MB of data: convert the cell below back to "Code" if you really want to execute it
In [ ]:
dtype_dict = {'page':unicode,
'revision':np.int64,
'anon':bool,
'comment':unicode,
'parentid':np.int64,
'size':np.int64,
'timestamp':unicode,
'user':unicode,
'userid':np.int64,
'unique_users':np.int64,
'date':unicode,
'diff':np.float64,
'latency':np.float64
}